From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.
Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.
From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.
We're also encouraging you to explore the dataset visually. What can we learn about the city through visualizations like this Top Crimes Map? The top most up-voted scripts from this competition will receive official Kaggle swag as prizes.
Submissions are evaluated using the multi-class logarithmic loss. Each incident has been labeled with one true class. For each incident, you must submit a set of predicted probabilities (one for every class). The formula is then,
logloss=−1N∑i=1N∑j=1Myijlog(pij), logloss=−1N∑i=1N∑j=1Myijlog(pij),
where N is the number of cases in the test set, M is the number of class labels, loglog is the natural logarithm, yijyij is 1 if observation ii is in class jj and 0 otherwise, and pijpij is the predicted probability that observation ii belongs to class jj.
The submitted probabilities for a given incident are not required to sum to one because they are rescaled prior to being scored (each row is divided by the row sum). In order to avoid the extremes of the log function, predicted probabilities are replaced with max(min(p,1−10−15),10−15)max(min(p,1−10−15),10−15).
You must submit a csv file with the incident id, all candidate class names, and a probability for each class. The order of the rows does not matter. The file must have a header and should look like the following:
Id,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,EMBEZZLEMENT,EXTORTION,FAMILY OFFENSES,FORGERY/COUNTERFEITING,FRAUD,GAMBLING,KIDNAPPING,LARCENY/THEFT,LIQUOR LAWS,LOITERING,MISSING PERSON,NON-CRIMINAL,OTHER OFFENSES,PORNOGRAPHY/OBSCENE MAT,PROSTITUTION,RECOVERED VEHICLE,ROBBERY,RUNAWAY,SECONDARY CODES,SEX OFFENSES FORCIBLE,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS 0,0.9,0.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 ... etc.
File Name | Available Formats |
---|---|
test.csv | .zip (18.75 mb) |
sampleSubmission.csv | .zip (2.38 mb) |
train.csv | .zip (22.09 mb) |
This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set.
Dates - timestamp of the crime incident
Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
Descript - detailed description of the crime incident (only in train.csv)
DayOfWeek - the day of the week
PdDistrict - name of the Police Department District
Resolution - how the crime incident was resolved (only in train.csv)
Address - the approximate street address of the crime incident
X - Longitude
Y - Latitude
In [1]:
import pandas as pd
import zipfile
#reading train dataset:
archive=zipfile.ZipFile("C:/Users/vutran/Desktop/github/kaggle/San Francisco Crime Classification/data/train.csv.zip",'r')
train_data=pd.read_csv(archive.open("train.csv"))
train_data.head()
Out[1]:
In [2]:
train_data.tail()
Out[2]:
In [3]:
train_data.dtypes
Out[3]:
In [4]:
train_data.info()
In [11]:
pd.unique(train_data.Category)
Out[11]:
In [12]:
pd.unique(train_data.Category).shape
Out[12]:
Now that we already have general idea of Data Set. We next clean, transform data to create useful features for machine learning
Deature Dates include both date and time, I'll shall only use time
In [51]:
feature_hour=pd.to_datetime(train_data.Dates).dt.hour
pd.unique(feature_hour)
Out[51]:
In [ ]:
dow = {
'Monday':0,
'Tuesday':1,
'Wednesday':2,
'Thursday':3,
'Friday':4,
'Saturday':5,
'Sunday':6
}
df['dayofweek'] = df.DayOfWeek.map(dow)